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Abstract 

A Monte Carlo simulation study was conducted to determine the 
bootstrap correction formula yielding the most accurate 
confidence intervals for robust measures of association. 
Confidence intervals were generated via the percentile, 
adjusted, BC, and BC a bootstrap procedures and applied to the 
Winsorized, percentage bend, and Pearson correlation 
coefficients. Type I error, bias, efficiency, and interval 
length were compared across correlational and bootstrap methods. 
Results revealed the superior resiliency of the robust measures 
over the Pearson r, though neither robust correlation 
outperformed the other. Unexpectedly, the four bootstrap 
techniques achieved roughly equivalent outcomes. Based on the 
these results, it appears that the more complex bootstrapping 
procedures may not be worth the additional computational 
expenditures . 




3 



Bootstrapping 



3 



Bootstrapping Confidence Intervals for 
Robust Measures of Association 

The purpose of this study was to compare bootstrapping 
approaches to estimating confidence intervals for various 
correlation coefficients. The paper initially describes the 
concept of robustness, robust and non-robust measures of 
association, and bootstrapping procedures. Next, the simulation 
method is detailed, including a discussion of indices used to 
compare the various measures and procedures. Finally, results 
conclusions, and recommendations are offered. 

The Concept of Robustness 

Classical statistical procedures are often inadequate due 
to their sensitivity to departures from distributional 
assumptions. The extent to which an estimator is able to 
withstand such deviations has been dubbed robustness (Box, 

1953) . The term robustness is here used in the narrow sense as 
applied only to distributional assumptions, though other 
standard assumptions could be invoked. Although conceptually 
distinct, distributional robustness and outlier resistance are 
essentially synonymous notions (Huber, 1981) . 

Robustness, or resistance as it was initially termed, was 
generally understood from the inception of the major statistical 
advances that occurred during the 19th and early 20th centuries 
but was not seriously examined until the 1950s (Staudte & 
Sheather, 1990; Stigler, 1973) . While some earlier theorists 
assessed the consequences of distributional nonnormality for 
hypothesis testing (viz., the robustness of validity ) , few 
explored the stability of power or of the length of confidence 
intervals (viz., the robustness of performance ) , though the 
latter "usually [brings] for free a satisfactory robustness of 
validity (but not vice versa)" (Huber, 1972, p. 1046) . Tukey 
counseled against the then-prevailing habit of "sweeping the 
dirt under the rug" (1960, p. 450) and ignoring comparisons of 
the relative efficiency of various point estimates. He 
admonished, "nearly imperceptible_non-normalities may make 
conventional relative efficiencies of estimates of scale and 
location entirely useless" (p. 474). 

Partly as a result of such admonitions, a number of robust 
analogs to traditional estimators, population parameters, and 
hypothesis-testing methods have been developed during the past 
40 years. Robust procedures typically retain the statistical 
interpretations associated with classical procedures but are 
more resistant to data nonnormality. 
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Nevertheless, applied researchers have been slow to adopt 
the newer methods. The hesitancy is due in part to a lack of 
knowledge. Over two decades ago, Bradley (1978) bemoaned the 
general poor treatment of assumption violation in elementary 
statistics textbooks noting that "reassuring complacency of 
tone, depreciating the consequences of assumption-violation, or 
using overly exuberant language to exult over claimed robustness 
seems to be endemic in statements about robustness" (p. 145) . It 
is no wonder that "most [modern] psychologists believe that all 
practical problems associated with statistical methods were 
solved by the year 1955" (Wilcox, 1998b, p. 60) . In spite of 
this widespread misconception, several have recently asserted 
the need for newer methods (e.g., Wilcox, 1996, 1998a; on the 
debate in general, see the May 1998 issue of the British Journal 
of Mathematical & Statistical Psychology , as well as Hampel's 
[1998] depiction of the current state of affairs) . 

The Pearson Product -Moment Correlation 

One common misconception involves the robustness of the 
product -moment correlation coefficient. The classical functional 
for measuring linear association between two continuously-scaled 
variables is defined as 



P *y 



S(x - n x ) (y - n y ) 

NG x Gy 



( 1 ) 



where p. x and p. y are the population means of the population 
variables x and y, respectively, and ct x and a y are the population 
standard deviations of x and y, respectively. The maximum 
likelihood sample estimator of rho, p, is estimated by r: 



P XY r xY 



s(x - X) (Y - Y ) 



ns x s Y 



( 2 ) 



where X , Y , s x , and s Y are the sample means and standard 
deviations of the sample variables X and Y, respectively. 

Sir Ronald A. Fisher defined the normal theory sampling 
distribution of r in samples of size n (1915, p. 508) to be 



f(r yv ) = 



2no x G Y ^Jl-p 2 



1 

1-P 2 



2p(x-|i jr ) (r— |iy) | (y-\i Y r 
2a x 2 2 c x a Y 2a y 2 



- 1 dxdy . (3) 
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With "large samples and moderate or small correlations" (Fisher, 
1958, p. 192) the sample correlation coefficient is distributed 

normally around p with variance 



var (r) = 



(1 - p 2 ) 2 
n - 1 



(4) 



The standard (1 - a) 100% confidence interval limits for p, 
again assuming sampling distribution normality, are calculated 
by 



Lower = r - z 1 _ o/2 * VvarCr) 



(5) 



and 



Upper = r - z a/2 * Vvar (r) 



( 6 ) 



where a is the inclusion probability. Here, z x _ al2 and z an are 
quantiles from the standard normal distribution. 

Fisher's z Transformation 

Because the sampling distribution of r is complicated when 
Pxy ^0, a transformation of r was proposed by Fisher (1915, 
1921) : 

p z = r z = z' = tanh" 1 r , p z = C, = tanh" 1 p , (7) 

or equivalently 

z ' = . 5 ln (1 + r) , £ = . 5 ln - + P) . (8) 

ln(l - r) ln(l - p) 



This inverse hyperbolic tangent transformation compensates for 
the nonnormality of the sampling distribution of r and 
approaches a normal distribution. 

A (1 - a) 100% confidence interval for z' is formed as 



z - 



• 5p 



A A 



- z J n 



3)‘ 



Lower 






n — Ij 



(9) 
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Upper = 




•5p ' 
n - lj 



+ z a (n - 3) 



( 10 ) 



where Za is the a quantile of the standard normal distribution 
and a the inclusion probability. The intervals can be reverse- 
transformed to the metric of r for ease of interpretation. 

Studies on the Robustness of r 

An extensive literature review of the robustness of r was 
conducted, but will not be presented here due to space 
limitations (for details, see King, 2000) . To summarize, 
approximately two-thirds of the reviewed studies expressed 
reservations for the use of r under some nonnormal conditions. 
That percentage was even higher when only more recent studies 
were considered. Due to the availability of more powerful 
simulation capabilities beginning around 1960, studies conducted 
after that date should be weighed more heavily. 

Under independence (i.e., p = 0) most reported the sampling 
distribution of r to be robust to bivariate nonnormality (Duncan 
Sc Layard, 1973; Dunlap, 1931; Gayen, 1951; Havlicek & Peterson, 
1976, 1977; Nair, 1941; Norris & Hjelm, 1961; Pearson, 1931, 

1932) , unless mixed or contaminated distributions were under 
review (Devlin, Gnanadesikan, & Kettenring, 1975; Duncan & 
Layard, 1973; Edgell & Noon, 1984; Kowalski, 1972; but cf. 
Pearson, 1929; Srivastava & Awan, 1984) . However, the majority 
of studies employed distributions that diverged minimally from 
the bivariate normal condition, especially the earlier studies 
(e.g., Pearson, 1931, reported skewness = .99 and kurtosis = 

3.83 for his most excessive condition!). For more extreme 
departures like severe skewness (Baker, 1930), the L 
distribution (Blair & Lawson, 1982) , or the Cauchy distribution 
(Edgell & Noon, 1984), r was not found to be robust. 

When p does not equal zero and the bivariate surface is 
nonnormal, r is likely biased, sometimes inordinately so. Of all 
the reviewed studies which included a dependence condition, only 
two (Pearson, 1929; Zeller & Levin, 1974) found r to be 
completely robust; the majority expressed reservations for at 
least some situations (Cheriyan, 1945; Chesire et al . , 1932; 
Devlin et al., 1975; Duncan & Layard, 1973; Gayen, 1951; 

Haldane, 1949; Hey, 1938; Kowalski, 1972; Norris & Hjelm, 1961; 
Rider's [1932] results, though not his comments). 

Numerous researchers hold that all uses of r are completely 
robust to distributional assumptions. This misunderstanding may 
have arisen due to the selective literature reviews conducted in 
some prominent studies. For example, Edgell and Noon (1984) 
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failed to survey robustness studies conducted by Baker (1930) 
and Blair and Lawson (1982) . Both established the inadequacy of 
r under certain nonnormal conditions, including the case of p = 
0. In addition, certain literature reviews neglected to cite any 
studies demonstrating the non- robustness of r (e.g., Zeller & 
Levine, 1974) or failed to accurately represent such findings in 
summarizing their literature reviews (e.g., Havlicek & Peterson, 
1976, p. 1321) . 

Further, when quantitative measures of resistance are 
applied to p, additional problems surface. For example, the 
influence function and breakdown point of p suggest that even a 
single pair of outlying scores can render the parameter 
virtually meaningless for quantifying the bivariate relationship 
underlying the majority of data points (Devlin, et al . , 1975; 
Wilcox, 1993 ) . 



Robust Measures of Correlation 

The literature on robust correlations is scattered and 
scarce, especially regarding comparative simulation studies. 
Though the quadrant correlation has been available for a century 
(Blomqvist, 1950; Sheppard, 1899) , the majority of resilient 
measures of association have been introduced only recently. Of 
these, two appear particularly promising. Along with possessing 
properties that curb the influence of distributional anomalies, 
both the Winsorized correlation (Devlin et al., 1975; 
Gnanadesikan & Kettenring, 1972; Wilcox, 1993) and the 
percentage bend correlation (Wilcox, 1994) yield interpretations 
analogous to the popular Pearson r. Yet few have explored these 
newer correlation coefficients, notably with respect to 
generating accurate confidence intervals. Likewise, little is 
known about the latent sampling distribution of each index. 

The Winsorized Correlation 



Winsorization involves ordering the scores in a 
distribution and then deleting the y smallest values and setting 
them equal to X (y + x) , and deleting the y largest values and 
setting them equal to X (n - y > • In other words, potential outliers 
are removed from each tail of the distribution and replaced with 
the most extreme score remaining in that tail. Delvin et al. 
(1975) present the equations and proofs for calculating a sample 
Winsorized correlation coefficient (r w ) based on the Winsorized 
sample mean and the Winsorized sample variance. This statistic 
can be conceptualized as the application of the ordinary Pearson 
formula to two Winsorized score distributions. 
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Wilcox (1993) evaluated Type I error probabilities and 
power calculations for the ordinary r versus r w via computer 
simulations. The Winsorized correlation bettered r in terms of 
error rates across a range of conditions. For the bivariate 
normal case r evidenced superior power; however, r w demonstrated 
greater power when nonnormal conditions were explored. Wilcox 
could not determine an optimal method for generating accurate 
confidence intervals for the dependence condition, including 
application of the bootstrap (though no results were offered) , 
but some success was had in transforming r w to a regression 
coefficient . 

The Percentage Bend Correlation 

Wilcox (1994) suggested a percentage bend correlation as a 
robust measure of association. This correlation is based on the 
percentage bend measure of location and the percentage bend 
midvariance. Computational details and proofs are discussed in 
Wilcox (see also King, 2000) . Outlying scores are defined via a 
constant labeled p. As with the Pearson and Winsorized 
correlations, the percentage bend correlation will equal zero 
(or come very close) under independence and fall between -1 and 
+1 . Many alternative robust correlations do not meet these 
criteria . 

Few studies have investigated this new correlation 
coefficient. Wilcox (1994) defined the measure and provided a 
test for independence. He also compared tests of statistical 
significance for r, r w , and rpb- In comparison with r, tests of 
rpb more closely mirrored expected Type I error rates for samples 
of size 10 and 20 with a nom inai = -05, except under bivariate 
normality. In terms of power evaluations, tests of £p b compared 
favorably with tests of r and generally outperformed tests of r w 
under nonnormal conditions, so long as P = .1 was used to compute 
rpb • However, the Pearson correlation edged out the others under 
distributional normality. 

Bootstrapping Confidence Intervals 

For statistics with no known sampling distribution, Efron's 
(1979, 1982) bootstrap has proven to be effective in a variety 
of contexts. The conjecture is that the sampling distribution of 
a statistic can be approximated by the distribution of a large 
number of resampled estimates of the statistic obtained from a 
single sample of observations. The distribution of resampled 
estimates forms an empirically-derived sampling distribution 
from which confidence intervals or other indices may be 
estimated. 
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Efron settled on the name from the mythical Baron von 
Munchausen's unique manner of escape from a deep lake: He pulled 
himself up by his bootstraps (Efron & Tibshirani, 1993). Tukey' s 
suggested nomenclature was equally descriptive. According to 
Efron (1979, p. 25), Tukey favored the term shotgun because the 
method "can blow the head off any problem if the statistician 
can stand the resulting mess." 

The bootstrap sampling distribution may be employed either 
for inferential purposes (e.g., testing hypotheses about 
parameters) or for description (e.g., estimating a likely range 
for some parameter at a given confidence level or result 
stability or replicability) (Thompson, 1993) . The technique is 
especially valuable when confronted with statistics having no 
known sampling distribution. 

Due to its efficacy, bootstrapping has become fashionable 
in many fields. Wilcox (1997a, p. 45) stated that over 1,000 
journal articles on bootstrapping have already seen publication! 
Further, "an almost bewildering array" of variants of bootstrap 
confidence intervals have been advanced (Hall, 1988, p. 927) . 
These vary in the accuracy with which the bootstrap-generated 
interval spans the true interval. Accuracy is also contingent on 
the statistic under examination: No single bootstrapping routine 
is optimal across all statistical techniques. 

The present investigation considered four of the more 
popular, well-studied procedures: the percentile bootstrap , the 
adjusted bootstrap , the bias-corrected bootstrap , and the bias- 
corrected and accelerated bootstrap . 1 Though some of the newer, 
more complex methods appear promising, it seems that most remain 
largely in the developmental stage or require unreasonable 
execution time. Additional research into the theoretical 
properties and practical usage of these newer approaches is 
still needed. 

Description of the Bootstrap 

Let x be a population of observations and X a random sample 
of n observations drawn from population x. A parameter 
summarizing the population will be denoted as 0, and the sample 



sampling distribution for the population. 

In Monte Carlo bootstrapping experiments, each of m 
simulated data sets (samples) are resampled B times, with each 
resample being of size n (Lunneborg, 2000) . Resampling involves 
repeatedly drawing observations from a sample with replacement 
until n is reached. Next, a bootstrap estimate of the parameter 
is acquired for each resample. The bootstrap estimate is 



estimate as 0 . The symbol 
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represented by 0* . The process is then repeated B times yielding 
a bootstrap sampling distribution of estimated parameters 

(denoted f( 0*)) from which confidence intervals and other 
information can be derived. In theory the bootstrap sampling 
distribution should mimic the true sampling distribution of the 
statistic . 2 

Types of Bootstrapped Confidence Intervals 

The percentile bootstrap (Efron, 1979) ranks as the most 
popular bootstrap procedure (according to Hall's, 1988, informal 
inquiries) , although "the great majority of non-technical 
statistical work. . .does not make it clear which. . . [technique] is 
employed" (Hall, p. 927) . This "arch" bootstrap (Sievers, 1996, 

p. 381) entails first calculating 0 * for each resample. The 
distribution of conditionally independent statistics, 0*,...,0‘, 
are then ordered by value. An a-level confidence interval 

includes all the values of 0 between the oc/2 and the 1 - a/2 
percentiles of the (ordered) bootstrap sampling distribution, 

Another option is to widen the traditional bootstrap 
confidence band by the factor [ (n+2) / (n-1) ] 1/2 (Efron, 1982). 

Both endpoints of the confidence interval are adjusted an equal 
amount. Following Strube (1988), this approach will be here 
referred to as the adjusted bootstrap . 

Third, Efron (1981, 1982, 1985) offered the bias-corrected 
(BC) bootstrap to correct for problems with the percentile 
bootstrap. The BC method allows for the possibility that the 

distribution of 0* - 0 is not centered on zero but is distributed 

/•V 

around a constant z n G~ , where g~ is the standard error of 0 . In 

y 0 0 

other words, the procedure rests on the assumption that there 
exists some monotonic transformation of whose differences are 

distributed in some known way, usually as a normal variate 
(hence the constant being labeled z 0 ) . The constant must next be 
estimated and applied to the bootstrap sampling distribution. 
Because the method is transformation invariant, it is not 
required that the specific transformation function be known, 
only that one exists (Mooney & Duval, 1993) . 

Efron (1987) proposed the bias corrected and accelerated 
(BC a ) bootstrap to stabilize the variance of the bootstrap 
distribution (i.e., remove the tendency for the variance to 
accelerate across values in the sampling distribution) , in 
addition to incorporating the bias correction described above. 
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Extensive calculations are involved with this procedure (but see 
Lunneborg [2000, pp. 164-165] for a readable description) . 

In spite of the involved computational demands, the BC a 
holds much promise as a bootstrap correction procedure for 
several reasons. In terms of coverage error, Efron (1987) and 
Hall (1988) proved that the BC a method yields second-order 
correctness under a wider class of problems than the BC 
bootstrap. Second, the BC a bootstrap is transformation invariant. 
Moreover, the specific transformation may remain unknown while 
employing the BC a . 

However, there are potential drawbacks to the method. 

First, it is assumed that a transformation exists that yields a 
distribution of differences with a known shape (Mooney & Duval, 
1993) . Second, Hall (1988) lamented the fact that standard 
normal tables must be consulted in applying the method. But with 
the advent of more complex statistical computer programs, this 
deterrence is soon becoming obsolete. Third, the acceleration 
constant proves troublesome to calculate in many cases (DiCiccio 
Sc Romano, 1989) . Finally, simulations conducted by Hall suggest 
that the BC a bootstrap may produce abnormally short intervals 
with small samples when wide nominal coverage is of interest. 

Research on Bootstrapping r 

A thorough literature review was conducted of studies that 
applied bootstrapping to the Pearson correlation. These results 
will be briefly summarized (for details, see King, 2000) . 
Although results were somewhat mixed, bootstrapping appears to 
improve upon the normal theory intervals both in terms of 
probability coverage and interval accuracy for most mixed normal 
and nonnormal distributional conditions (Efron, 1988; Hall, 
Martin, & Schucany, 1989; Lunneborg, 1985; Rasmussen, 1988, 

1989; Sievers, 1996; Strube, 1988). But under bivariate 
normality the standard intervals may be preferable (Rasmussen, 
1987) . While some studies reported problems with statistical 
significance levels when stringent nominal alphas were employed, 
others found the opposite effect (cf. the 1987 and 1989 
experiments both conducted by Rasmussen!). Methodological 
variations may be the source of such discrepancies. 

In comparing the various bootstrapping procedures, it seems 
that the adjusted intervals (i.e., adjusted, BC, and iterated 
bootstrap 3 ) formed more accurate confidence intervals and 
probability levels than did the percentile and percentile-t 3 
methods (e.g., Hall et al . , 1989; Strube, 1988). Results for the 
latter two techniques differed in that the percentile tended to 
undercover while the percentile-t tended to overcover (Hall et 
al., 1989; Rasmussen, 1987; Sievers, 1996). The ordinary 
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percentile-t was found to effect relatively large standard 
errors (Hall et al., 1989; Sievers, 1996), though a modified 
version performed well. Finally, a study by Wilcox (1991) in 
which no bootstrap method achieved accurate intervals under 
variable dependence seems to be somewhat of an anomaly. 

While it seems that at least some problems with testing and 
estimating p can be alleviated by applying the bootstrap, others 
cannot. Even if a bootstrap method proves valuable in obtaining 
proper confidence intervals for p, one may not wish to use a 
correlation coefficient that is so affected by outliers and 
related distributional departures from normality. 

Research on Bootstrapping Robust Correlations 



Little research has been done on the application of the 
bootstrap to robust correlations. Wilcox (1997b) proposed a test 
for the independence of p > 2 random variables using the 
percentage bend correlation. Although this multivariate 
application is not germane to our study, he also examined the 
computation of confidence intervals for p P b in the bivariate 
case. A percentile bootstrap method was employed in which B = 

399 resamples were simulated from theoretical distributions 
demonstrating varying levels of skewness and kurtosis (in some 
cases reaching very extreme conditions) . One thousand samples 
were acquired per condition. Population correlations of p = 0, 

.3, .6, and .9 were examined. Bootstrapping was also applied to p 

for comparative purposes. 

Because the percentage bend correlation does not exactly 
equal p under dependence, approximate populations were generated 
by drawing 100,000 pairs of samples and then calculating p pb for 
each. The bootstrap confidence intervals for p pb more closely 
spanned the desired interval for practically all conditions and 
sample sizes than did those computed for p. The latter diverged 
considerably under extreme nonnormality. For example, with n = 

50, p = .9, a = .05, and high kurtosis, the confidence interval 
for p pb yielded an interval with probability coverage of 1 - a = 
.966, while the interval for p only covered at .253. Though 
results were not provided, Wilcox assessed conditions in which 
one or both marginal distributions were skewed, sometimes in 
opposite directions. These outcomes were (reportedly) promising 
as well. 

Wilcox (2001) applied the percentile bootstrap, using B = 
599, to two measures of association: p and Spearman's rank 
correlation (p s ) . He also examined a modified percentile method 
in conjunction with p, and a nested bootstrap applied to all 
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three correlation coefficients. As regards Type I error, Wilcox 
found that the nested bootstrap with p produced acceptable error 
probabilities, and either bootstrap method worked well with the 
robust correlations. However, none of the methods resulted in 
satisfactory confidence intervals. 

Importance and Purpose of the Study 

Given the current emphasis in many fields on reporting 
effect size measures and confidence intervals for point 
estimates (Thompson, 2001; Vacha-Haase, T., Nilsson, J.E., 

Reetz, D.R., Lance, T.S., & Thompson, B. , 2000; Wilkinson & APA 
Task Force on Statistical Inference, 1999), estimation is 
clearly an important issue to consider in addition to 
inferential testing. But at the current level of research, it is 
unknown which bootstrap technique should be exercised in 
constructing confidence intervals for robust measures of linear 
relationship . 

Although the Winsorized and percentage bend correlations 
have been compared with each other and with the ordinary 
correlation in terms of committing Type I errors (Wilcox, 1993, 
1994, 1997b), only Wilcox (1997b, 2001) has examined the 
accuracy of bootstrapped confidence intervals for robust 
measures. Clearly, more research is needed in this area. So the 
primary purpose of the present study was to compare various 
methods of bootstrapping confidence intervals for each of the 
above-mentioned robust correlations and p. For completeness, the 
Fisher- transformed correlation will be included, although it 
frequently fails to produce even asymptotically correct results 
(Duncan & Layard, 1973) . 

Research Questions 

1. How do confidence intervals for the robust correlations 
p w and p P b compare with those for p and its transform in terms of 
accuracy and stability? 

2 . Which bootstrap method provides the most accurate and 
stable confidence intervals for p w ? for p P b? 

3. With small samples which bootstrap method provides the 
most accurate and stable confidence intervals for p w ? for p pb ? 

4. Under extreme distributional conditions which bootstrap 
method provides the most accurate and stable confidence 
intervals for p w ? for p P b? 
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Method 



General Simulation Procedure 



Due to its computational efficiency in bootstrapping 
(Wilcox, 1997b) , all computer simulations were conducted using 
the S-Plus statistical package running on a 600 MHz Pentium III 
computer with 256 Mb RAM. The procedure for the simulations was 
as follows: 

1. Randomly generate N = 1,000,000 observations from a 
population with known characteristics (i.e., constrained through 
the simulation procedure to have certain parametric properties) . 
This is the derived population . The step is necessary because 
the Winsorized and percentage bend correlations will not usually 
exactly equal p when dependence exists, and a simulated 
population allows comparisons between p and the robust 
correlations . 

2. Calculate the parameters p, p z , p w , and p P b for the 
population . 

3. Randomly select without replacement a sample of size n 
from the population. 

4 . Calculate the statistics r, r ZJ _ r w _j_ and rp b for the 
sample . 

5. Randomly select with replacement a resample of size n 
from the sample. This is one bootstrap sample. 

6. Calculate the statistics r\ r z *, r w *, and rp b * for the 
resample, where the asterisk denotes a bootstrap estimate. 

7. Repeat Steps 5 and 6 a total of B = 500 times forming 
500 bootstrap samples. 

8. Calculate 95% confidence intervals for p, p z , p w , and p pb 
using each of the four bootstrap procedures. 

9. Repeat Steps 3 through 8 a total of m = 100 times 
forming 100 samples. 

10. Repeat Steps 1 through 9 for each condition of interest 
(described below) . 

While one would normally wish to secure at least 1,000 
bootstrap samples (for a fuller discussion, see King, 2000) , the 
extensive computations required for the present study (i.e., 
estimation of confidence intervals via four bootstrap methods 
for each of four correlation coefficients per sample) precluded 
this as a possibility. With B set to 3,000, for example, results 
from only about 25 samples were acquired in eight hours time. 
With 100 samples needed per condition and around 150 total 
conditions to be evaluated, employing such a large number of 
resamples became impractical. 
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Instead, it was decided that 500 resamples would yield 
relatively accurate confidence intervals, at least for 
comparative purposes. Each bootstrap method should "suffer" 
equally due to the small value of B, assuming that the accuracy 
of each is dependent on sample size to the same extent. 
Similarly, it would have been desirable to generate 1,000 
samples, but m = 100 was settled upon due to computational time 
restraints . 



Population Characteristics 



Real data often demonstrate excessive distributional 
nonnormality (Bradley, 1977; Micceri, 1989; Rasmussen, 1986; 
Stigler , 1973; Wilcox, 1990) . In fact, after reviewing 440 
empirical studies, Micceri (1989) encountered only 15.2% of the 
score distributions with both tails having weights at or about 
those for the Gaussian distribution. Various distributional 
abnormalities can moderate the accuracy of a bootstrap procedure 
for a given statistic (Hall, 1988; Wilcox, 1997b) . Thus for the 
present study it was decided to vary distributional shape, 
strength of correlation, and sample size in evaluating bootstrap 
approaches. Contaminated and mixed distributions were also 
investigated . 

Following Wilcox (1997b) the g and h distribution (Hoaglin, 
1985) was adopted to alter the skewness and kurtosis of each 
population. Hoaglin' s distributions allow one to construct 
marginals according to four general shapes: normal, symmetric 
with a heavy tail, asymmetric with a light tail, and asymmetric 
with a heavy tail. Increasing g skews X, and h influences 
kurtosis. When both g and h are set to zero, X has a standard 
normal distribution. 

The method consists of initially generating observations 
from a standard normal distribution, Z, for each variable. X is 
then set to 



X 



V z - 1 
. sr 



V z ’ /2 

J 



(82) 



for g > 0; otherwise 



X 



Ze 



hZ 2 / 2 



(83) 



to prevent division by zero. 

Table 1 lists summary statistics using several illustrative 
populations each having 1,000,000 observations for the values of 
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g and h that were employed in the present study. A wide range of 
marginal distributional shapes were investigated. 

Several sample sizes were examined: n = 20, 50, 100, and 
250. The larger ns represent sample sizes greater than are 
typically available to most social scientists (cf. Thompson, 

1999; Thompson & Snyder, 1997) , and samples smaller than n = 20 
should probably not be employed with robust correlations. 

Strength of linear relationship was also varied. Population 
rhos took on values of 0, .4, or 8. However, these values will 

only be met for the Pearson correlation because the Winsorized 
and percentage bend correlations will not necessarily equal p 
under dependence . 

In addition, the effect of mixed and contaminated 
distributions was investigated. Three mixed distributions were 
created for each of two nonnormal distributional conditions. 
Various population correlations (i.e., p » 0, .3, .5, .7) were 

constructed under two distributional conditions (slight 
kurtosis: marginal distributions set to g = 0, h = .1; extreme 
nonnormality: marginals set to g = .5, h = .3). The classic case 
of contamination is formed by combining a normal marginal 
distribution (X) and its square (X 2 ) . This is a stringent test of 
any index because the distributions are completely dependent but 
also uncorrelated (Edgell & Noon, 1984) . This condition was 
evaluated with n set to 100. 

Comparative Criteria 

Type I error . Type I error rate, interval ratio, bias, and 
standard error estimates were used for comparing the performance 

of the bootstrap methods. Observed Type I error rates were 

- h <* \ 

recorded using the equation p obs = m # (0.=.,. s e £ 9 W ,|, where m 

A 

= the number of samples drawn, and the 0 s are the estimated 
lower and upper limits of the confidence interval (Sievers, 

1996) . The standard error of each alpha rate is equal to 

JsQ - a) / m , where s is the statistical significance level and m 
the number of samples drawn from the simulated population 
(Rasmussen, 1989). Values within ±2SE (i.e., 

V- 05(1 - . 05) / 100 = .022 * 2 = .044) may be considered within 
sampling error, though the discrepancy between a given error 
rate and the nominal value is not especially relevant to our 
purposes because the present concern is in locating the 
bootstrap method yielding the most accurate error rate per 
correlation type, regardless of the value's distance from the 
nominal rate. Nevertheless, the demarcation is useful in 
revealing bootstrap methods that be particularly inaccurate. 
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Interval ratio . Interval accuracy is determined by 
comparing the bootstrap confidence intervals to "true" 
confidence intervals. However, the distribution of m sample 
statistics obtained from Step 9 is not sufficient to constitute 
a Monte Carlo estimate of the true confidence intervals. 
Therefore, 10,000 samples were drawn from each derived 
population. This process was repeated for each sample size 
condition. The distribution of correlation coefficients 
calculated form each sample serves as an estimated "true" 
sampling distribution from which the quantiles Q a / 2 and Qi . a /2 can 
be obtained. These quantiles are empirical estimates of the 
"true" limits of the sampling distribution of 0 and were used as 
a standard against which to compare several indices to be 
described next (see King, 2000, for details) . 

A modification of a ratio proposed by Efron (1988) was 
applied to compare interval lengths. Efron's index entailed 
dividing the length of the parametric interval of r by a 
bootstrap interval (see Equation 52) . As it was not the aim of 
the present study to compare bootstrap to parametric intervals, 
the length of each bootstrap interval was divided by the length 
of the "true" (Monte Carlo-estimated) confidence interval. While 
this ratio indexes the extent to which the intervals span an 
equal distance, it does not necessarily quantify the discrepancy 
between bootstrap endpoints and those of the "true" interval. 

Two intervals could conceivably have identical lengths but no 
overlap at all. 

Bias . An even more useful gauge is bias. Bias quantifies 
the average discrepancy between the sample estimates and the 
parameter. If a negative value is obtained, the estimator 

underestimates the parameter on average. If BIAS (§) = 0, the 

A /V 

estimator 0 is unbiased and the sampling distribution of 0 is 
centered on 0 . Because the present study was primarily designed 
to compare confidence intervals and not point estimates, the 
ordinary bias formula was modified such that each "true" 
endpoint was compared with the bootstrap-estimated endpoint: 



SP 



■'lower (i) ^lower 



+ 



■'upper (i) ^upper 



Interval Bias = 



m 



(ID 



where E lower and S upper are the "true" (1 - a) 100% lower and upper 

confidence intervals for 0 , and H, ower and H upper are the bootstrap- 
estimated intervals. Because the parameter of interest in this 
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case is not the population correlation coefficient but the 
endpoints of the "true" interval, a new symbol is needed for 
referencing the theoretical endpoints. This was arbitrarily set 
to be S (Xi) so that theta may be reserved for the usual 
population correlation coefficient. 

Equation 11 computes the average of the absolute value of 
the differences between each bootstrap-estimated endpoint from 
the corresponding Monte Carlo-generated "true" endpoint. Such a 
bias indicator would seem to be particularly relevant in 
evaluating confidence intervals. While the distinction between 
upper and lower endpoint bias is obscured due to the sumation 
performed in the equation, such is rarely of interest. 

Efficiency . The standard error (SE) of a statistic is 
defined as 



and measures the spread of the estimates. This index is often 
referred to as a measure of efficiency . Equation 12 was modified 
similarly to that described for the bias measure. 

Comparative Proctedures 

Correlational analysis . Although not typically reported in 
simulation studies, Type I error rates for the four bootstrap 
methods could be correlated to further explore the variable 
relationships. These correlations measure the consistency with 
which error rates for any two bootstrap methods covary across 
the range of simulation conditions. However, one should not 
infer from a large correlation that error rates for the two 
methods are identical. In fact, rates for one bootstrap 
procedure may run systematically higher or lower than another, 
but if the rates for each rise and fall monotonically in the 
same fashion across simulation conditions, then the correlation 
coefficient will fall near 1.0. Incidentally, this use of the 
Pearson r illustrates the value of the ordinary correlation for 
situations in which one does not wish to remove outliers. 

Analysis of Variance . Although previous studies have 
usually compared error rates in an informal manner, a more 
formal statistical analysis was needed to process the large 
number of indices obtained. Analysis of Variance (ANOVA) was 
proposed as a means of quantifying the sources of variation 
affecting error rates. There are five systematic (nonrandom) 




2 



(12) 
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variance components that could influence error rate: correlation 
type, bootstrap method, distributional shape (g/h combination) , 
sample size (n) , and strength of population bivariate 
relationship (p) . 

One might be tempted to incorporate all of these factors in 
a five-way ANOVA as independent variables. Ignoring the 
interpretive difficulties inherent in such a proposal, such a 
partitioning is not possible because there is only one 
observation (error rate value) per cell of the balanced design. 
Instead, separate ANOVAs were computed in hopes of untangling 
the variable relationships. In particular, ANOVAs were computed 
in which (a) bootstrap method and correlation type served as 
predictors; (b) the distributional indicator ( g/h combination) 
was added to the variables listed in (a) ; (c) strength of 

population correlation (p) was added to (a) ; and (d) sample size 
was added to (a) . The drawback to such a course is that higher- 
order interactions cannot be assessed and quantified. 

Results 

1 . How do confidence intervals for the robust correlations p w and 
Ppb compare with those for p and its transform in terms of 
accuracy and stability? 

Tables depicting alpha rates, bias, and interval ratios 
were constructed by averaging across the m samples for each 
combination of correlation type and bootstrap method and were 
broken down by distributional condition, sample size, and 
strength of population correlation. The data were also collapsed 
across the nine distributional conditions for easier viewing, 
but any performance attributable to distributional shape is 
masked by such a summary table. Tables 2-6 and Figures 1-2 
display representative results. Disaggregated data and fuller 
explanations can be found in King (2000) . 

Confidence intervals formed for the robust correlations 
outperformed those for r and r z across most of the specified 
criteria. Type I error rates generally fell closer to the 
nominal value, bias and efficiency were minimal, and "true" 
interval lengths were more faithfully reproduced by r w and rp b . 
Specifically, with nonnormal marginal distributions the robust 
rs frequently provided increased accuracy in error and bias 
rates. With normal distributions all four correlations 
functioned similarly in terms of Type I probability, though r 
and r 2 bettered r w and rp b according to bias, at least when small 
samples were involved. 

These findings for bias are not unanticipated because the 
robust calculations are intended to act as correctives when 
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nonnormal conditions arise. But the same trend should have 
emerged for Type I error rates . 

The robust measures clearly surpassed r for both error 
probability and bias rate when contaminated and mixed 
distributions were examined. The divergence between robust and 
nonrobust methods was most noticeable under the extremely 
nonnormal g = .5, h = .3 distributional conditions. The 
differential performance was particularly evident in the bias 
index. 

It should also be noted that neater, more consistent 
results were realized with the bias criterion than with Type I 
error rate, in general. This is probably due to the dichotomous 
nature of the latter (i.e., a given interval either does or does 
not enclose the parameter) , in contrast to the ratio- scaled bias 
index. These dynamics may be responsible for some of the 
unexpected outcomes involving Type I error. 

Efficiency rates varied little across correlational 
measures; all four evidenced similar stability. But when 
disparities arose, the robust correlations were more efficient. 

In terms of interval length the Pearson statistics 
consistently underestimated the "true" (Monte Carlo simulated) 
endpoints, more so under nonnormal conditions. At times, such 
intervals were little more than half the "true" length. Interval 
lengths for the Winsorized correlation were a bit longer than 
desired, while the percentage bend correlation closely mimicked 
the "true" intervals in almost every instance. 

2 . Which bootstrap method provides the most accurate and stable 
confidence intervals for p w ? for p pb ? 

No bootstrap technique emerged as unmistakably superior 
across a majority of conditions, though the BC and ordinary 
percentile methods yielded slightly more accurate intervals in 
some cases. The BC a yielded results similar to the percentile and 
BC procedures in all but a few situations, while the adjusted 
bootstrap intervals were, by and large, unacceptable. 

3 . With small samples which bootstrap method provides the most 
accurate and stable confidence intervals for p w ? for p P b? 

Bootstrap procedures were not found to differ appreciably 
in terms of Type I error rate or bias for the majority of the 
simulation experiments. Consequently, one method cannot be 
selected as optimal for any specific condition, including sample 
size. Nonetheless, the adjusted bootstrap can be ruled out as a 
contender due to its relative inaccuracy, especially in terms of 
error probabilities. In addition, the BC a frequently evidenced 
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slightly elevated bias and error rates with r and r z and thus 
should probably not be applied to the ordinary correlation. 

A minor trend in stability emerged at small sample sizes as 
interval lengths for the adjusted bootstrap ran wide, next was 
the BC a , then the BC, and finally the ordinary percentile. These 
differences subsided with large samples in place. This trend was 
at least partly unexpected in that the BC and BC a techniques 
should not fall systematically shorter or longer than the 
percentile method: The bias and variance inflation corrections 
may alter the intervals in either direction. But the adjusted 
bootstrap would be expected to form wider intervals than the 
ordinary bootstrap because the calculations involve a sample 
size correction. These dynamics account for the similarity 
between the adjusted and percentile ratios when large ns were 
involved . 

Theory predicts that the bootstrap modifications should 
diverge in accuracy for each of the three indicators (error, 
bias, and interval ratio) . The BC a should stand as the most 
accurate approach, then the BC, and, lastly, the percentile 
method. The sample size adjustment would likely fall between the 
BC and percentile techniques. But no such orderly progression 
was observed for the first two criteria, other than the 
unacceptable confidence intervals produced by the adjusted 
bootstrap. 

4 . Under extreme distributional conditions which bootstrap 
method provides the most accurate and stable confidence 

intervals for p w ? for p P b? 

With skewness and kurtosis elevated no systematic 
relationships were observed, beyond those just discussed, 
between distributional shape and bootstrap method with reference 
to accuracy or stability. This was true for both correlation 
coefficients . 

Under contaminated and mixed distributions unstable error 
rates arose when BC a intervals were combined with r, but no 
substantial problems surfaced with the technique applied to 
robust correlations. Minor peaks in bias were similarly noted 
for the BC a , but only when joined with r or r z . Surprisingly, the 
adjusted bootstrap was as accurate as the percentile and BC 
techniques for these atypical conditions. 

Concluding Remarks 

Wilcox (2001, p.46) summarized his bootstrapping results 
thusly, "These results paint a rather complicated picture about 
which method... to use." The same could be said of results from 
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the present study. However, three conclusions can be drawn from 
this fairly extensive simulation study. First, as regards the 
bootstrap procedures, the four methods under investigation 
performed comparably. The sample size adjustment to the 
percentile bootstrap did not help matters and may have actually 
inflated Type I error, bias, and efficiency rates, while the 
more complex BC and BC a procedures failed to offer sizeable 
improvements in interval accuracy and thus are probably not 
worth the involved calculations . 

These findings are disappointing, in one sense, because the 
adjustments were expected to generate greater accuracy in 
forming confidence intervals. On the other hand, researchers can 
be more confident that the ordinary percentile method is capable 
of delivering relatively precise confidence intervals, at least 
as applied to the four correlational measures under review and 
assuming these results are generalizable . Of course there are 
numerous bootstrap procedures that have recently seen 
development, and it is always possible that one of these may 
yield more accurate intervals. 

One reason that the more complicated procedures did not 
surpass the percentile bootstrap may be due to the technical 
specifications of the simulation experiments. It was originally 
intended that 1,000 samples be drawn for each condition, but 
this number was reduced to 100 given excessive computational 
demands. While the goal here is not to fully reproduce a 
sampling distribution, more samples may be necessary to achieve 
stable asymptotic dynamics. 

For the same reason, the number of bootstrap samples was 
reduced from 2,000 to 500. However, in this case the objective 

is to model the sampling distribution f( 0) by means of f(§ ). Five 
hundred resamples are more than sufficient for estimating 
standard errors but too low to form tight confidence intervals. 
It was hoped that the sampling inaccuracy would influence all 
bootstrap procedures in an identical fashion. But if some 
methods are more dependent than others on the number of 
resamples necessary to form correct intervals (e.g., the BC a ) , 
then results will be biased against the technique. 

For example, a modest number of resamples may give rise to 
inadequate score representation in the tails of the sampling 
distribution. The tails are needed in developing accurate 
intervals, particularly when greater coverage probability is 
desired. A nominal alpha rate of .05 was employed throughout 
this experiment due to its popularity in social science 
research, but Efron (1988) warned against testing hypotheses 
using bootstrap resampling at such a strict probability level 
unless 1,000 or more bootstrap samples were acquired. This may 
be the cause of the Type I error rates not stabilizing at larger 
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sample sizes. Future studies should increase both m and B to 
larger sizes, possibly through the use of more efficient 
algorithms . 

A second conclusion involves the similarity between the 
Winsorized and percentage bend correlations in interval 
accuracy. While neither robust correlation outperformed the 
other, both were shown to be more resilient than r (and r z ) to 
distributional abnormalities. In fact, the robust measures 
compared favorably to r even under the bivariate normal 
conditions. Wilcox (1993, 1997b) reported findings very similar 
to these for both the Winsorized and percentage bend 
correlations, respectively. 

Third, Fisher's transformation did not appreciably improve 
either Type I error rate or bias associated with r, especially 
when averaged across all distributional conditions. It would 
seem that when bootstrapping the Pearson correlation, the 
transformation merely increases computational time without 
concomitantly affecting accuracy. But these results are in 
keeping with Seivers' (1996) conclusions about r z . 

In summary, the two robust correlations are to be preferred 
over the ordinary Pearson equations when resilience to outliers 
is needed. These new measures also compared favorably across 
normal distributional conditions and can be recommended for 
general use in cases where it is desired to obtain a measure of 
linear association reflecting the majority of the sample 
observations. None of the bootstrap methods under review were 
differentially accurate; each can be similarly endorsed, 
excepting the adjusted bootstrap. Perhaps future studies will 
produce interval-estimation techniques that afford heightened 
precision when applied to robust measures of correlation. 
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Footnotes 

1 These terras are not uniformly used in the literature. 
Actually, it is assumed that the difference distributions, 
f(§ - e) and f(§* -§), are equivalent. 

3 Not described in the present paper. 
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Table 1 



Illustrative 


Parameter Values for 


the g and h 


Distribution 




2 h 


M SD 


Skewness 


Kurtosis 
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0 . 000 


1.000 


0 . 001 


0.000 
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. 1 
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1.182 


0 . 004 


2.572 
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. 3 
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1.966 
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1.030 
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0.666 


.2 


. 1 
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1.227 


1 . 049 


4.871 


.2 


. 3 
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2 . 152 


11 . 955 


1,151.082 


. 5 
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1.211 


1 . 758 


5 . 918 


. 5 


. 1 


0.313 


1.505 


3 .391 


38.287 


. 5 


.3 


0.465 


3 . 035 


21.175 


1,413.162 


Note. Entries 


based on 


1,000,000 simulated observations. 


Skewness 


and kurtosis 


were computed 


as: skewness 


= m 3 / m 2 (m 2 ) 


kurtosis 
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/ m 2 2 ) - 
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Table 3 



Analysis of Variance 


for Type I Error 


Rate by Correlation Type 


and Bootstrap 


Method 


Source 


df 


F 


E 


Tl 2 


Model 


15 


11.028 


< .001 


.088 


CORR 


3 


50.511 


<.001 


. 081 


BOOT 


3 


2.735 


. 042 


. 004 


CORR * BOOT 


9 


.631 


.772 


. 003 


Error 


1712 


( . 002) 






Total 


1727 









Note . Value enclosed in parentheses represents mean square error. 
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Table 5 

Analysis of Variance for Bias by Correlation Type and Bootstrap Method 



Source 


df 


F 


E 


n 2 


Model 


15 


3.497 


<.001 


. 030 


CORR 


3 


15.558 


< . 001 


. 026 


BOOT 


3 


1.551 


. 199 


. 003 


CORR * BOOT 


9 


.125 


.999 


. 001 


Error 


1712 


(.010) 






Total 


1727 








Note. Value enclosed 


in parentheses 


represents mean 


square 


error . 
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Figure 1. Mean Type I error rate by correlation type and 
bootstrap method. Reference line indicates the nominal alpha 
rate of .05. 
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Figure 2. Mean bias by correlation type and bootstrap method. 
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